Exploratory Data Analysis: Wisconsin Diagnostic Breast Cancer (WDBC)¶
1.1 Introduction¶
This report analyzes the Wisconsin Diagnostic Breast Cancer (WDBC) dataset to identify key features distinguishing malignant from benign tumors. The data features were computed from digitized images of fine needle aspirates (FNA) of breast masses, describing the characteristics of the cell nuclei present in the image (Wolberg et al., 1995).
1.2 Data Acquisition¶
The raw data was retrieved directly from the UCI Machine Learning Repository to ensure reproducibility. The dataset consists of 569 instances with 30 real-valued input features and one binary target variable (Diagnosis).
# Imports (only nesseccary for EDA)
import pandas as pd
import numpy as np
import altair_ally as aly
import altair as alt
alt.data_transformers.enable('vegafusion')
from ucimlrepo import fetch_ucirepo
# import the data
# Code from https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic
# Need ucimlrepo package to load the data
raw_data = fetch_ucirepo(id=17)
raw_X = raw_data.data.features
raw_y = raw_data.data.targets
raw_df = pd.concat([raw_X, raw_y], axis=1)
raw_df.to_csv("../data/raw/breast_cancer_raw.csv", index=False)
2. Data Cleaning, Schema Mapping, and Data Validation¶
The raw dataset lacks semantic column headers. To facilitate analysis, we implemented a schema mapping strategy based on the wdbc.names metadata. The 30 features represent ten distinct cell nucleus characteristics (e.g., Radius, Texture) computed in three statistical forms.
We applied the following suffix mapping transformation:
- Mean Value: Suffix
1->_mean - Standard Error: Suffix
2->_se - Worst (Max) Value: Suffix
3->_max
This step ensures all features are semantically interpretable for the subsequent EDA.
To ensure the dataset is clean, consistent, and ready for modeling, we validated the data by implementing the following checks:
Correct Data File Format The cleaned dataset was exported as a standard CSV (breast_cancer_cleaned.csv) with UTF-8 encoding. It successfully loaded in pandas without errors, confirming proper file format and readability.
Correct Column Names All 30 feature columns follow the expected naming convention (
_mean, _se, _max). The Diagnosis column contains only "Benign" and "Malignant" values, and the total number of columns is 31, as expected. No Empty Observations All rows contain complete observations. There are no fully empty rows, and no partial missing values were detected.
Missingness Within Expected Threshold A threshold of 5% missingness per column was applied. No columns exceeded this limit, ensuring all features are sufficiently complete for reliable modeling.
By combining schema mapping with these validation checks, the dataset is fully consistent, correctly formatted, and reproducible, providing a robust foundation for downstream modeling and analysis.
# Attempt loading the file to ensure it’s a valid CSV [Ensuring correct Data File Format]
try:
df = pd.read_csv('../data/processed/breast_cancer_cleaned.csv')
print("File loaded successfully. Format OK.")
except Exception as e:
raise AssertionError(f"File format error: {e}")
# Ensure no unnamed index column is within the data
assert not any(df.columns.str.contains("Unnamed")), \
"Error: Unnamed index column detected!"
# Clean the column names based on description
clean_columns = []
for col in raw_X.columns:
if col.endswith('1'):
clean_name = col[:-1] + '_mean'
elif col.endswith('2'):
clean_name = col[:-1] + '_se'
elif col.endswith('3'):
clean_name = col[:-1] + '_max'
else:
clean_name = col
clean_columns.append(clean_name)
raw_X.columns = clean_columns
X = raw_X.copy()
# Clean the target column
y = raw_y.copy()
y['Diagnosis'] = y['Diagnosis'].map({'M': 'Malignant', 'B': 'Benign'})
clean_df = pd.concat([X, y], axis=1)
# Must be 31 columns: 30 features + Diagnosis
assert clean_df.shape[1] == 31, f"Unexpected number of columns: {df.shape[1]}"
# Check naming pattern
allowed_suffixes = ("_mean", "_se", "_max")
feature_cols = [c for c in clean_df.columns if c != "Diagnosis"]
# All feature columns must end with one of the suffixes
for col in feature_cols:
assert col.endswith(allowed_suffixes), f"Invalid column name: {col}"
# Diagnosis must contain valid labels
assert clean_df["Diagnosis"].isin(["Benign", "Malignant"]).all(), "Invalid Diagnosis values detected"
print("Column names OK")
empty_rows = df.isna().all(axis=1).sum()
assert empty_rows == 0, f"Found {empty_rows} completely empty rows!"
print("No empty observations.")
partial_missing = df.isna().any(axis=1).sum()
print(f"Rows with any missing values: {partial_missing}")
threshold = 0.05
missing_ratio = df.isna().mean()
too_high = missing_ratio[missing_ratio > threshold]
assert too_high.empty, f"Columns exceeding missingness threshold:\n{too_high}"
print("Missingness below threshold.")
# Export the cleaned data
clean_df.to_csv('../data/processed/breast_cancer_cleaned.csv', index=False)
clean_df
File loaded successfully. Format OK. Column names OK No empty observations. Rows with any missing values: 0 Missingness below threshold.
| radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave_points_mean | symmetry_mean | fractal_dimension_mean | ... | texture_max | perimeter_max | area_max | smoothness_max | compactness_max | concavity_max | concave_points_max | symmetry_max | fractal_dimension_max | Diagnosis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.30010 | 0.14710 | 0.2419 | 0.07871 | ... | 17.33 | 184.60 | 2019.0 | 0.16220 | 0.66560 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | Malignant |
| 1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.08690 | 0.07017 | 0.1812 | 0.05667 | ... | 23.41 | 158.80 | 1956.0 | 0.12380 | 0.18660 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | Malignant |
| 2 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.19740 | 0.12790 | 0.2069 | 0.05999 | ... | 25.53 | 152.50 | 1709.0 | 0.14440 | 0.42450 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | Malignant |
| 3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.24140 | 0.10520 | 0.2597 | 0.09744 | ... | 26.50 | 98.87 | 567.7 | 0.20980 | 0.86630 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | Malignant |
| 4 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.19800 | 0.10430 | 0.1809 | 0.05883 | ... | 16.67 | 152.20 | 1575.0 | 0.13740 | 0.20500 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | Malignant |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 564 | 21.56 | 22.39 | 142.00 | 1479.0 | 0.11100 | 0.11590 | 0.24390 | 0.13890 | 0.1726 | 0.05623 | ... | 26.40 | 166.10 | 2027.0 | 0.14100 | 0.21130 | 0.4107 | 0.2216 | 0.2060 | 0.07115 | Malignant |
| 565 | 20.13 | 28.25 | 131.20 | 1261.0 | 0.09780 | 0.10340 | 0.14400 | 0.09791 | 0.1752 | 0.05533 | ... | 38.25 | 155.00 | 1731.0 | 0.11660 | 0.19220 | 0.3215 | 0.1628 | 0.2572 | 0.06637 | Malignant |
| 566 | 16.60 | 28.08 | 108.30 | 858.1 | 0.08455 | 0.10230 | 0.09251 | 0.05302 | 0.1590 | 0.05648 | ... | 34.12 | 126.70 | 1124.0 | 0.11390 | 0.30940 | 0.3403 | 0.1418 | 0.2218 | 0.07820 | Malignant |
| 567 | 20.60 | 29.33 | 140.10 | 1265.0 | 0.11780 | 0.27700 | 0.35140 | 0.15200 | 0.2397 | 0.07016 | ... | 39.42 | 184.60 | 1821.0 | 0.16500 | 0.86810 | 0.9387 | 0.2650 | 0.4087 | 0.12400 | Malignant |
| 568 | 7.76 | 24.54 | 47.92 | 181.0 | 0.05263 | 0.04362 | 0.00000 | 0.00000 | 0.1587 | 0.05884 | ... | 30.37 | 59.16 | 268.6 | 0.08996 | 0.06444 | 0.0000 | 0.0000 | 0.2871 | 0.07039 | Benign |
569 rows × 31 columns
# Data Validation (5-8)
# 5. Correct data types in each column
feature_cols = [c for c in clean_df.columns if c != 'Diagnosis']
assert clean_df[feature_cols].select_dtypes(include=['number']).shape[1] == len(feature_cols), \
"Error: Non-numeric features detected."
# 6. No duplicate observations
n_duplicates = clean_df.duplicated().sum()
assert n_duplicates == 0, \
f"Error: {n_duplicates} duplicate observations detected."
# 7. No outlier or anomalous values
suffixes = ['_mean', '_se', '_max']
charts = []
for suffix in suffixes:
cols = [c for c in clean_df.columns if c.endswith(suffix) and c != 'Diagnosis']
# The range of 'Area' feature (values ~1000+) exceed other features (values < 1).
# We use 'symlog' (Symmetric Log) to handle potential 0 values in features like 'Concavity'.
chart = alt.Chart(clean_df).mark_boxplot(extent=1.5).encode(
x=alt.X('value:Q', title='Value (Symlog)', scale=alt.Scale(type='symlog')),
y=alt.Y('variable:N', title='Feature'),
color=alt.Color('Diagnosis:N', title='Diagnosis'),
tooltip=['variable:N', 'value:Q', 'Diagnosis']
).transform_fold(
cols,
as_=['variable', 'value']
).properties(
title=f'Distribution of {suffix} Features',
width=400,
height=300
)
charts.append(chart)
display(alt.vconcat(*charts).resolve_scale(x='independent'))
Outlier Analysis & Scaling Strategy¶
Due to the massive disparity in magnitude between features (e.g., Area > 2000 vs. Smoothness < 0.2), a Symmetric Log (Symlog) scale was applied to the visualizations. This effectively mitigates the compressing effect of the Area feature, allowing the distribution and spread of smaller-scale variables to be clearly observed without losing the information from larger values.
Post-scaling inspection reveals numerous outliers (points beyond whiskers), particularly in Malignant samples (Orange).
- Significance: These are not data errors. In the context of breast cancer, extreme values in features like
Area,Concavity, andPerimeterare characteristic of malignant tumor growth. - Conclusion: These points represent high-priority biological signals essential for classification.
Preprocessing Recommendation
- Action: Do not drop these outliers, as removing them would discard critical diagnostic information.
- Strategy: To handle the skewness and scale differences during modeling:
- Apply Log Transformation (
np.log1p) to right-skewed features (Area, Perimeter) to normalize distributions. - Apply Standard Scaling (
StandardScaler) to all features to ensure the model treats all dimensions with equal weight.
- Apply Log Transformation (
# 8. Correct category levels (i.e., no string mismatches or single values)
target_counts = clean_df['Diagnosis'].value_counts(dropna=False)
## No Single category
assert len(target_counts) > 1, \
f"Error: Single category detected. Only found: {target_counts.index.tolist()}"
## At least 2 samples per category
min_samples = target_counts.min()
assert min_samples > 1, \
f"Error: Found a category with a single observation! Min samples: {min_samples}"
## No unexpected category labels
expected_classes = {'Malignant', 'Benign'}
actual_classes = set(target_counts.index)
assert actual_classes == expected_classes, \
f"Error: Unexpected category labels found! Found: {actual_classes}, Expected: {expected_classes}"
- Target/response variable follows expected distribution
- We validate that the target variable
Diagnosisis not severely imbalanced. If one class is much rarer than the other, this can hurt model performance and may require special handling (for example, resampling or adjusting evaluation metrics).
- No anomalous correlations between target and features
- We check how strongly each feature is associated with
Diagnosis. Extremely high predictive power for a single feature can indicate data leakage or unexpected dependencies that should be investigated.
- No anomalous correlations between features
- We examine pairwise correlations between features. If many feature pairs are highly correlated, this suggests redundancy or multicollinearity, which may require feature selection or dimensionality reduction.
from deepchecks.tabular import Dataset
from deepchecks.tabular.checks import ClassImbalance, FeatureLabelCorrelation
from deepchecks.tabular.checks.data_integrity import FeatureFeatureCorrelation
bc_dataset = Dataset(
clean_df,
label='Diagnosis'
)
# 9. Target/response variable follows expected distribution
class_imbalance_check = ClassImbalance().add_condition_class_ratio_less_than(
class_imbalance_ratio_th=0.2 # flag if minority / majority < 0.2
)
class_imbalance_result = class_imbalance_check.run(bc_dataset)
class_imbalance_result
/Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/site-packages/deepchecks/core/serialization/dataframe/html.py:16: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. deepchecks - WARNING - It is recommended to initialize Dataset with categorical features by doing "Dataset(df, cat_features=categorical_list)". No categorical features were passed, therefore heuristically inferring categorical features in the data. 0 categorical features were inferred.
VBox(children=(HTML(value='<h4><b>Class Imbalance</b></h4>'), HTML(value='<p>Check if a dataset is imbalanced …
# 10. Feature–target correlations
flc_check = FeatureLabelCorrelation().add_condition_feature_pps_less_than(
threshold=0.8 # flag features that are *too* predictive of the label
)
flc_result = flc_check.run(bc_dataset)
flc_result
VBox(children=(HTML(value='<h4><b>Feature Label Correlation</b></h4>'), HTML(value='<p>Return the PPS (Predict…
# 11. Feature–feature correlations
ffc_check = FeatureFeatureCorrelation().add_condition_max_number_of_pairs_above_threshold(
0.95,
10
)
ffc_result = ffc_check.run(bc_dataset)
ffc_result
VBox(children=(HTML(value='<h4><b>Feature-Feature Correlation</b></h4>'), HTML(value='<p> Checks for pairwi…
The FeatureFeatureCorrelation check fails the condition we set. In our data there are many feature pairs with very high correlation, for example radius_mean, perimeter_mean and area_mean, as well as their corresponding _max and _se versions. This pattern is expected for this dataset because these variables all describe related geometric properties of the tumor, so strong correlations are not a data quality error but a sign of redundancy and multicollinearity.
3. Data Profiling: Structure and Statistics¶
Purpose:
df.info(): Used to verify data integrity by checking for null values and ensuring all feature columns are offloat64type.df.describe(): Used to examine the central tendency and spread of numeric features. This highlights differences in magnitude (scales) across variables.
Observation:
The dataset is complete (no missing values). However, describe() reveals massive scale disparities (e.g., area_mean ranges up to 2500, while smoothness_mean is < 0.1), confirming the necessity for Feature Scaling (Standardization) before modeling.
clean_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 569 entries, 0 to 568 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 radius_mean 569 non-null float64 1 texture_mean 569 non-null float64 2 perimeter_mean 569 non-null float64 3 area_mean 569 non-null float64 4 smoothness_mean 569 non-null float64 5 compactness_mean 569 non-null float64 6 concavity_mean 569 non-null float64 7 concave_points_mean 569 non-null float64 8 symmetry_mean 569 non-null float64 9 fractal_dimension_mean 569 non-null float64 10 radius_se 569 non-null float64 11 texture_se 569 non-null float64 12 perimeter_se 569 non-null float64 13 area_se 569 non-null float64 14 smoothness_se 569 non-null float64 15 compactness_se 569 non-null float64 16 concavity_se 569 non-null float64 17 concave_points_se 569 non-null float64 18 symmetry_se 569 non-null float64 19 fractal_dimension_se 569 non-null float64 20 radius_max 569 non-null float64 21 texture_max 569 non-null float64 22 perimeter_max 569 non-null float64 23 area_max 569 non-null float64 24 smoothness_max 569 non-null float64 25 compactness_max 569 non-null float64 26 concavity_max 569 non-null float64 27 concave_points_max 569 non-null float64 28 symmetry_max 569 non-null float64 29 fractal_dimension_max 569 non-null float64 30 Diagnosis 569 non-null object dtypes: float64(30), object(1) memory usage: 137.9+ KB
clean_df.describe()
| radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave_points_mean | symmetry_mean | fractal_dimension_mean | ... | radius_max | texture_max | perimeter_max | area_max | smoothness_max | compactness_max | concavity_max | concave_points_max | symmetry_max | fractal_dimension_max | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | ... | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 |
| mean | 14.127292 | 19.289649 | 91.969033 | 654.889104 | 0.096360 | 0.104341 | 0.088799 | 0.048919 | 0.181162 | 0.062798 | ... | 16.269190 | 25.677223 | 107.261213 | 880.583128 | 0.132369 | 0.254265 | 0.272188 | 0.114606 | 0.290076 | 0.083946 |
| std | 3.524049 | 4.301036 | 24.298981 | 351.914129 | 0.014064 | 0.052813 | 0.079720 | 0.038803 | 0.027414 | 0.007060 | ... | 4.833242 | 6.146258 | 33.602542 | 569.356993 | 0.022832 | 0.157336 | 0.208624 | 0.065732 | 0.061867 | 0.018061 |
| min | 6.981000 | 9.710000 | 43.790000 | 143.500000 | 0.052630 | 0.019380 | 0.000000 | 0.000000 | 0.106000 | 0.049960 | ... | 7.930000 | 12.020000 | 50.410000 | 185.200000 | 0.071170 | 0.027290 | 0.000000 | 0.000000 | 0.156500 | 0.055040 |
| 25% | 11.700000 | 16.170000 | 75.170000 | 420.300000 | 0.086370 | 0.064920 | 0.029560 | 0.020310 | 0.161900 | 0.057700 | ... | 13.010000 | 21.080000 | 84.110000 | 515.300000 | 0.116600 | 0.147200 | 0.114500 | 0.064930 | 0.250400 | 0.071460 |
| 50% | 13.370000 | 18.840000 | 86.240000 | 551.100000 | 0.095870 | 0.092630 | 0.061540 | 0.033500 | 0.179200 | 0.061540 | ... | 14.970000 | 25.410000 | 97.660000 | 686.500000 | 0.131300 | 0.211900 | 0.226700 | 0.099930 | 0.282200 | 0.080040 |
| 75% | 15.780000 | 21.800000 | 104.100000 | 782.700000 | 0.105300 | 0.130400 | 0.130700 | 0.074000 | 0.195700 | 0.066120 | ... | 18.790000 | 29.720000 | 125.400000 | 1084.000000 | 0.146000 | 0.339100 | 0.382900 | 0.161400 | 0.317900 | 0.092080 |
| max | 28.110000 | 39.280000 | 188.500000 | 2501.000000 | 0.163400 | 0.345400 | 0.426800 | 0.201200 | 0.304000 | 0.097440 | ... | 36.040000 | 49.540000 | 251.200000 | 4254.000000 | 0.222600 | 1.058000 | 1.252000 | 0.291000 | 0.663800 | 0.207500 |
8 rows × 30 columns
4. Correlation Analysis: Pearson vs. Spearman¶
Method:
- Pearson Correlation: Measures linear relationships.
- Spearman Correlation: Measures monotonic rank relationships (non-linear). Comparing both helps identify if relationships are strictly linear or just trending in the same direction.
Purpose: To detect Multicollinearity—redundant features that increase model complexity without adding information.
Results:
Both metrics show near-perfect correlation ($>0.95$) between Radius, Perimeter, and Area. This confirms these features are geometrically redundant. We should retain only one (e.g., Radius) and drop the others to improve model stability.
# Multicollinearity
corr_chart = aly.corr(clean_df)
corr_chart.save('../results/images/corr_chart.png')
corr_chart.save('../results/images/corr_chart.svg')
corr_chart
/Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/site-packages/altair/utils/deprecation.py:65: AltairDeprecationWarning: 'selection_multi' is deprecated. Use 'selection_point' /Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/site-packages/altair/vegalite/v5/api.py:394: AltairDeprecationWarning: The value of 'empty' should be True or False.
5. Pairwise Separability Analysis¶
Purpose: To visualize 2D decision boundaries. We look for feature combinations where the Benign (Blue) and Malignant (Orange) clusters are clearly distinct with minimal overlap.
Results:
- High Separability: Features related to size (
radius_mean) and shape complexity (concavity_mean) separate the classes well. - Non-linear patterns: The curved relationship between
areaandradiusis clearly visible, reinforcing the geometric redundancy found in the correlation analysis.
# Only include mean as it provide a lot of info
cols_mean = [c for c in clean_df.columns if '_mean' in c] + ['Diagnosis']
pair_chart = aly.pair(clean_df[cols_mean], color='Diagnosis:N')
pair_chart.save('../results/images/pair_chart.png')
pair_chart.save('../results/images/pair_chart.svg')
pair_chart
/Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/site-packages/altair/utils/deprecation.py:65: AltairDeprecationWarning: 'selection_multi' is deprecated. Use 'selection_point' /Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/site-packages/altair/utils/deprecation.py:65: AltairDeprecationWarning: 'add_selection' is deprecated. Use 'add_params' instead.
6. Distribution Analysis¶
Purpose: To inspect the univariate "shape" of the data. We look for Skewness (asymmetry) and Outliers that could bias linear models.
Results:
- Skewness: Features like
area_seandconcavity_meanare heavily right-skewed (long tail to the right). This indicates that Log Transformation is required to normalize these distributions. - Overlap: "Texture" and "Smoothness" show high overlap between classes, suggesting they are less informative on their own compared to "Size" features.
dist_chart =aly.dist(clean_df, color='Diagnosis')
dist_chart.save('../results/images/dist_chart.png')
dist_chart.save('../results/images/dist_chart.svg')
dist_chart
EDA Findings¶
- Class Separation:
- High Separability: Features related to size (
radius,perimeter,area) and concavity (concave_points,concavity) show clear distinction between Benign and Malignant classes (Malignant samples generally have higher values). - Low Separability: Texture, Smoothness, and Fractal Dimension show significant overlap, indicating they are weaker individual predictors.
- High Separability: Features related to size (
- Distributions:
- Skewness: "Area" and "Concavity" features (both
_meanand_se) are heavily right-skewed. - Outliers: Visible in the upper tails of
area_maxandperimeter_se.
- Skewness: "Area" and "Concavity" features (both
- Correlations (Multicollinearity):
- Severe Multicollinearity:
radius,perimeter, andareaare perfectly correlated ($R \approx 1$). This is expected geometrically but redundant for models. concavity,concave_points, andcompactnessalso exhibit very high positive correlation.
- Severe Multicollinearity:
Preprocessing Recommendations¶
Based on the above, the following pipeline is suggesued:
- Feature Selection / Drop:
- Remove redundant features to reduce multicollinearity. Keep
radius(orperimeter), but dropareaandperimeteras they duplicate information.
- Remove redundant features to reduce multicollinearity. Keep
- Transformation:
- Apply Log Transformation to skewed features (e.g.,
area,concavity) to normalize distributions.
- Apply Log Transformation to skewed features (e.g.,
- Scaling:
- Features vary vastly in scale (e.g.,
area> 1000 vs.smoothness< 0.2). UseStandardScalerto standardize all features to unit variance.
- Features vary vastly in scale (e.g.,
- Imputation:
- None needed (Data is clean).
Onto Creating a Classification Model¶
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
X = clean_df.drop('Diagnosis', axis=1)
y = clean_df['Diagnosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
X_train.columns
Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se',
'fractal_dimension_se', 'radius_max', 'texture_max', 'perimeter_max',
'area_max', 'smoothness_max', 'compactness_max', 'concavity_max',
'concave_points_max', 'symmetry_max', 'fractal_dimension_max'],
dtype='object')
numeric_feats = ['radius_mean', 'texture_mean',
'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se',
'fractal_dimension_se', 'radius_max', 'texture_max',
'smoothness_max', 'compactness_max', 'concavity_max',
'concave_points_max', 'symmetry_max', 'fractal_dimension_max']
drop_feats = [
'perimeter_mean',
'area_mean',
'perimeter_se',
'area_se',
'texture_se',
'smoothness_se',
'symmetry_se',
'perimeter_max',
'area_max'
]
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline
ct = make_column_transformer(
(StandardScaler(), numeric_feats),
("drop", drop_feats)
)
pipe = Pipeline([
("preprocess", ct),
("svc", SVC())
])
param_grid = {
"svc__gamma": [0.001, 0.01, 0.1, 1.0, 10, 100],
"svc__C": [0.001, 0.01, 0.1, 1.0, 10, 100]
}
gs = GridSearchCV(
estimator = pipe,
param_grid = param_grid,
cv = 15,
n_jobs = -1,
return_train_score = True
)
gs.fit(X_train, y_train)
/Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/multiprocessing/queues.py:122: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. return _ForkingPickler.loads(res) /Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/multiprocessing/queues.py:122: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. return _ForkingPickler.loads(res) /Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/multiprocessing/queues.py:122: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. return _ForkingPickler.loads(res) /Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/multiprocessing/queues.py:122: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. return _ForkingPickler.loads(res) /Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/multiprocessing/queues.py:122: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. return _ForkingPickler.loads(res) /Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/multiprocessing/queues.py:122: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. return _ForkingPickler.loads(res) /Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/multiprocessing/queues.py:122: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. return _ForkingPickler.loads(res) /Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/multiprocessing/queues.py:122: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. return _ForkingPickler.loads(res) /Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/multiprocessing/queues.py:122: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. return _ForkingPickler.loads(res) /Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/multiprocessing/queues.py:122: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. return _ForkingPickler.loads(res)
GridSearchCV(cv=15,
estimator=Pipeline(steps=[('preprocess',
ColumnTransformer(transformers=[('standardscaler',
StandardScaler(),
['radius_mean',
'texture_mean',
'smoothness_mean',
'compactness_mean',
'concavity_mean',
'concave_points_mean',
'symmetry_mean',
'fractal_dimension_mean',
'radius_se',
'texture_se',
'smoothness_se',
'compactness_se',
'concavity_se',
'con...
'concavity_max',
'concave_points_max',
'symmetry_max',
'fractal_dimension_max']),
('drop',
'drop',
['perimeter_mean',
'area_mean',
'perimeter_se',
'area_se',
'texture_se',
'smoothness_se',
'symmetry_se',
'perimeter_max',
'area_max'])])),
('svc', SVC())]),
n_jobs=-1,
param_grid={'svc__C': [0.001, 0.01, 0.1, 1.0, 10, 100],
'svc__gamma': [0.001, 0.01, 0.1, 1.0, 10, 100]},
return_train_score=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| estimator | Pipeline(step...svc', SVC())]) | |
| param_grid | {'svc__C': [0.001, 0.01, ...], 'svc__gamma': [0.001, 0.01, ...]} | |
| scoring | None | |
| n_jobs | -1 | |
| refit | True | |
| cv | 15 | |
| verbose | 0 | |
| pre_dispatch | '2*n_jobs' | |
| error_score | nan | |
| return_train_score | True |
Parameters
| transformers | [('standardscaler', ...), ('drop', ...)] | |
| remainder | 'drop' | |
| sparse_threshold | 0.3 | |
| n_jobs | None | |
| transformer_weights | None | |
| verbose | False | |
| verbose_feature_names_out | True | |
| force_int_remainder_cols | 'deprecated' |
['radius_mean', 'texture_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_max', 'texture_max', 'smoothness_max', 'compactness_max', 'concavity_max', 'concave_points_max', 'symmetry_max', 'fractal_dimension_max']
Parameters
| copy | True | |
| with_mean | True | |
| with_std | True |
['perimeter_mean', 'area_mean', 'perimeter_se', 'area_se', 'texture_se', 'smoothness_se', 'symmetry_se', 'perimeter_max', 'area_max']
drop
Parameters
| C | 10 | |
| kernel | 'rbf' | |
| degree | 3 | |
| gamma | 0.01 | |
| coef0 | 0.0 | |
| shrinking | True | |
| probability | False | |
| tol | 0.001 | |
| cache_size | 200 | |
| class_weight | None | |
| verbose | False | |
| max_iter | -1 | |
| decision_function_shape | 'ovr' | |
| break_ties | False | |
| random_state | None |
results = pd.DataFrame(gs.cv_results_)
best_performing = results[['param_svc__C', 'param_svc__gamma', 'mean_test_score']].sort_values(
by='mean_test_score', ascending=False
).head(10)
heatmap_data = results[['param_svc__C', 'param_svc__gamma', 'mean_test_score']].copy()
heatmap_data['C'] = heatmap_data['param_svc__C'].astype(str)
heatmap_data['gamma'] = heatmap_data['param_svc__gamma'].astype(str)
heatmap = alt.Chart(heatmap_data).mark_rect().encode(
x = alt.X('gamma:N', title='gamma'),
y = alt.Y('C:N', title='C'),
color = alt.Color('mean_test_score:Q', scale=alt.Scale(scheme='viridis')),
tooltip = ['C', 'gamma', 'mean_test_score']
).properties(
width = 400,
height = 400,
title = 'SVM GridSearchCV Mean Test Scores'
)
best_performing
| param_svc__C | param_svc__gamma | mean_test_score | |
|---|---|---|---|
| 25 | 10.0 | 0.010 | 0.969176 |
| 31 | 100.0 | 0.010 | 0.966667 |
| 30 | 100.0 | 0.001 | 0.960287 |
| 19 | 1.0 | 0.010 | 0.955986 |
| 24 | 10.0 | 0.001 | 0.955914 |
| 20 | 1.0 | 0.100 | 0.955914 |
| 26 | 10.0 | 0.100 | 0.953620 |
| 32 | 100.0 | 0.100 | 0.951470 |
| 18 | 1.0 | 0.001 | 0.931613 |
| 14 | 0.1 | 0.100 | 0.927455 |
heatmap.display()
from sklearn.metrics import classification_report, confusion_matrix
y_pred = gs.predict(X_test)
report = classification_report(y_test, y_pred, output_dict=True)
report_df = pd.DataFrame(report).transpose().drop('support', axis = 1).drop(['macro avg', 'weighted avg'])
report_df
| precision | recall | f1-score | |
|---|---|---|---|
| Benign | 0.986486 | 1.000000 | 0.993197 |
| Malignant | 1.000000 | 0.975610 | 0.987654 |
| accuracy | 0.991228 | 0.991228 | 0.991228 |
cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(cm, index = gs.classes_, columns = gs.classes_)
cm_melted = cm_df.reset_index().melt(id_vars='index')
cm_melted.columns = ['Actual', 'Predicted', 'Count']
heatmap = alt.Chart(cm_melted).mark_rect().encode(
x = alt.X('Predicted:N', title = 'Predicted'),
y = alt.Y('Actual:N', title = 'Actual'),
color = alt.Color('Count:Q', scale = alt.Scale(scheme ='viridis'))
).properties(
width = 400,
height = 400,
title = 'Confusion Matrix Heatmap'
)
text = alt.Chart(cm_melted).mark_text(color = 'white').encode(
x = 'Predicted:N',
y = 'Actual:N',
text = 'Count:Q'
)
heatmap + text
Discussion:¶
Our model performed very well, achieving high accuracy on the test set and correctly classifying nearly all cases. This result was generally expected given the strong feature patterns observed during EDA, which suggested clear separation between benign and malignant tumours.
The main concern is the single false negative, where a malignant tumour was predicted as benign. Even though this is rare, such an error carries significant clinical risk and highlights that the model, while strong, is not yet reliable enough for real world medical use.
These results suggest future work should explore methods aimed at reducing false negatives, such as adjusting class weights, using cost-sensitive training, or validating on external datasets to assess robustness.
References¶
Reitz, Kenneth. 2011. Requests HTTP for Humans. https://requests.readthedocs.io/en/master/.
American Cancer Society. 2024. “Breast Cancer Facts & Figures.” https://www.cancer.org/cancer/types/breast-cancer.html.
National Cancer Institute. 2024. “Breast Cancer Treatment (PDQ).” https://www.cancer.gov/types/breast/patient/breast-treatment-pdq.
UCI Machine Learning Repository. 2017. “Breast Cancer Wisconsin (Diagnostic) Data Set.” https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic).